minimax risk
Blind Attacks on Machine Learners
Alex Beatson, Zhaoran Wang, Han Liu
The importance of studying the robustness of learners to malicious data is well established. While much work has been done establishing both robust estimators and effective data injection attacks when the attacker is omniscient, the ability of an attacker to provably harm learning while having access to little information is largely unstudied. We study the potential of a "blind attacker" to provably limit a learner's performance by data injection attack without observing the learner's training set or any parameter of the distribution from which it is drawn. We provide examples of simple yet effective attacks in two settings: firstly, where an "informed learner" knows the strategy chosen by the attacker, and secondly, where a "blind learner" knows only the proportion of malicious data and some family to which the malicious distribution chosen by the attacker belongs. For each attack, we analyze minimax rates of convergence and establish lower bounds on the learner's minimax risk, exhibiting limits on a learner's ability to learn under data injection attack even when the attacker is "blind".
Total Variation Classes Beyond 1d: Minimax Rates, and the Limitations of Linear Smoothers
Veeranjaneyulu Sadhanala, Yu-Xiang Wang, Ryan J. Tibshirani
We consider the problem of estimating a function defined over nlocations on a d-dimensional grid (having all side lengths equal to n1/d). When the function is constrained to have discrete total variation bounded by Cn, we derive the minimax optimal (squared) `2 estimation error rate, parametrized by n,Cn. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Several simpler estimators exist, such as Laplacian smoothing and Laplacian eigenmaps. A natural question is: can these simpler estimators perform just as well?
Information-theoretic Limits of Online Classification with Noisy Labels
We study online classification with general hypothesis classes where the true labels are determined by some function within the class, but are corrupted by stochastic noise, and the features are generated adversarially. Predictions are made using observed labels and noiseless features, while the performance is measured via minimax risk when comparing against labels. The noisy mechanism is modeled via a general noisy kernel that specifies, for any individual data point, a set of distributions from which the actual noisy label distribution is chosen. We show that minimax risk is characterized (up to a logarithmic factor of the hypothesis class size) by the of the noisy label distributions induced by the kernel, of other properties such as the means and variances of the noise. Our main technique is based on a novel reduction to an online comparison scheme of two hypotheses, along with a new version of Le Cam-Birgé testing suitable for online settings. Our work provides the first comprehensive characterization of noisy online classification with guarantees that apply to the while addressing noisy observations.
Towards Sharp Minimax Risk Bounds for Operator Learning
Adcock, Ben, Maier, Gregor, Parhi, Rahul
A new paradigm in machine learning for scientific computing is focused on designing learning algorithms and methods for continuum problems. This paradigm is referred to as operator learning and has received considerable interest in the last few years [5,7,18,20,23-25,27,30,34,36]. The basic task may be posed as learning a map between infinite-dimensional function spaces, i.e., learning an operator F: X Y, where, for example, X and Y are real, separable Hilbert spaces. Operator learning naturally arises in many scientific problems where one wants to learn how a continuum model, often described by partial differential equations (PDEs), maps inputs, such as parameters or boundary conditions, to outputs, such as states or observables. A prototypical example to keep in mind is learning parameter-to-solution maps of parametric PDEs [1,2,11]. In contrast to more classical surrogate modeling, which typically focuses on learning finite-dimensional parameter-to-solution maps for some fixed discretization, operator learning directly aims to learn/approximate the continuum map F: X Y itself. Thus, the inputs and outputs are functions (not vectors) and the goal is to directly design discretization-invariant methods [7,23]. From a statistical perspective, this naturally leads to a nonparametric regression problem in which both the object of interest (the operator) and the observations (finite number of noisy samples) are infinite-dimensional.
Learning the score under shape constraints
Lewis, Rebecca M., Feng, Oliver Y., Reeve, Henry W. J., Xu, Min, Samworth, Richard J.
Score estimation has recently emerged as a key modern statistical challenge, due to its pivotal role in generative modelling via diffusion models. Moreover, it is an essential ingredient in a new approach to linear regression via convex $M$-estimation, where the corresponding error densities are projected onto the log-concave class. Motivated by these applications, we study the minimax risk of score estimation with respect to squared $L^2(P_0)$-loss, where $P_0$ denotes an underlying log-concave distribution on $\mathbb{R}$. Such distributions have decreasing score functions, but on its own, this shape constraint is insufficient to guarantee a finite minimax risk. We therefore define subclasses of log-concave densities that capture two fundamental aspects of the estimation problem. First, we establish the crucial impact of tail behaviour on score estimation by determining the minimax rate over a class of log-concave densities whose score function exhibits controlled growth relative to the quantile levels. Second, we explore the interplay between smoothness and log-concavity by considering the class of log-concave densities with a scale restriction and a $(β,L)$-Hölder assumption on the log-density for some $β\in [1,2]$. We show that the minimax risk over this latter class is of order $L^{2/(2β+1)}n^{-β/(2β+1)}$ up to poly-logarithmic factors, where $n$ denotes the sample size. When $β< 2$, this rate is faster than could be obtained under either the shape constraint or the smoothness assumption alone. Our upper bounds are attained by a locally adaptive, multiscale estimator constructed from a uniform confidence band for the score function. This study highlights intriguing differences between the score estimation and density estimation problems over this shape-constrained class.
Performance Guarantees for Quantum Neural Estimation of Entropies
Sreekumar, Sreejith, Goldfeld, Ziv, Wilde, Mark M.
Estimating quantum entropies and divergences is an important problem in quantum physics, information theory, and machine learning. Quantum neural estimators (QNEs), which utilize a hybrid classical-quantum architecture, have recently emerged as an appealing computational framework for estimating these measures. Such estimators combine classical neural networks with parametrized quantum circuits, and their deployment typically entails tedious tuning of hyperparameters controlling the sample size, network architecture, and circuit topology. This work initiates the study of formal guarantees for QNEs of measured (Rényi) relative entropies in the form of non-asymptotic error risk bounds. We further establish exponential tail bounds showing that the error is sub-Gaussian, and thus sharply concentrates about the ground truth value. For an appropriate sub-class of density operator pairs on a space of dimension $d$ with bounded Thompson metric, our theory establishes a copy complexity of $O(|Θ(\mathcal{U})|d/ε^2)$ for QNE with a quantum circuit parameter set $Θ(\mathcal{U})$, which has minimax optimal dependence on the accuracy $ε$. Additionally, if the density operator pairs are permutation invariant, we improve the dimension dependence above to $O(|Θ(\mathcal{U})|\mathrm{polylog}(d)/ε^2)$. Our theory aims to facilitate principled implementation of QNEs for measured relative entropies and guide hyperparameter tuning in practice.